Sampling and inference for discrete random probability measures in probabilistic programs

نویسندگان

  • Benjamin Bloem-Reddy
  • Emile Mathieu
  • Adam Foster
  • Tom Rainforth
  • Yee Whye Teh
  • María Lomelí
  • Zoubin Ghahramani
چکیده

We consider the problem of sampling a sequence from a discrete random probability measure (RPM) with countable support, under (probabilistic) constraints of finite memory and computation. A canonical example is sampling from the Dirichlet Process, which can be accomplished using its stick-breaking representation and lazy initialization of its atoms. We show that efficiently lazy initialization is possible if and only if a size-biased representation of the discrete RPM is used. For models constructed from such discrete RPMs, we consider the implications for generic particle-based inference methods in probabilistic programming systems. To demonstrate, we implement SMC for Normalized Inverse Gaussian Process mixture models in Turing. Bayesian non-parametric (BNP) models are a powerful and flexible class of methods for carrying out Bayesian analysis [25]. By allowing an unbounded number of parameters, BNP models can adapt to the data, providing an increasingly complex representation as more data becomes available. However, a major drawback to BNP modeling is that the resultant inference problems are often challenging, meaning that many models require custom-built inference schemes that are challenging and time consuming to design, thereby hampering the development and implementation of new models and applications. Probabilistic programming systems (PPSs) [e.g., 11; 42; 14] have the potential to alleviate this problem by providing an expressive modeling framework, and automating the required inference, making powerful statistical methods accessible to non-experts. Universal probabilistic programming languages [11; 29] may be particularly useful in the context of BNP modeling [e.g., 5] because they allow for the number of parameters to vary stochastically and provide automated inference algorithms to suit [40; 42]. Currently, most systems only provide explicit support for Dirichlet Processes (DPs) [8; 35] or direct extensions thereof (see Section 1.2). The contributions of this paper are threefold. Firstly, we introduce the concept of the laziest initialization of a discrete RPM, which provides a computationally and inferentially efficient representation of the RPM suitable for a PPS. We show that this can be carried out only when the atoms of the RPM have a size-biased representation. Secondly, we derive the probability distribution of the number of variables initialized by the commonly used recursive coin-flipping implementation of the DP and its heavy-tailed extension, the Pitman–Yor Process (PYP). We show that this number has finite expectation for only part of its parameter range, indicating that although the coin-flipping recursion halts with probability 1, and thus it is computable, it may be undesirable for practical purposes. Finally, we demonstrate posterior inference for Normalized Inverse Gaussian Process (NIGP) mixture models using Turing [9]. To our knowledge, this is the first BNP mixture model in a PPS with posterior inference that is not the DP or PYP. ∗Equal contribution. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 1 Size-biased representations and lazy initialization A discrete random probability measure (RPM) P on a measurable space (W,Σ) is a countable collection of probability weights (Pj)j≥1 such that ∑ j≥1 Pj = 1 a.s., and atoms (Ωj)j≥1 ∈ W such that P(A) = ∑ j≥1 PjδΩj (A) a.s. for any A ∈ Σ. A size-biased permutation π of P = (Pj ,Ωj)j≥1, denoted P̃ = (P̃j , Ω̃j)j≥1, is a random permutation of the atoms of P such that [28] (P̃1, Ω̃1) = (Pπ(1),Ωπ(1)) where P(π(1) = j | P1, P2, . . . ) = Pj (1) (P̃2, Ω̃2) = (Pπ(2),Ωπ(2)) where P(π(2) = j | π(1), P1, P2, . . . ) = Pj 1− P̃1 , and so on. If a discrete RPM P̃ is equal in distribution to first sampling a realization P ∼ μ and then applying (1), we say that P̃ is a size-biased version of P. P is said to be lazily initialized by a computer program if each atom of P is not instantiated in memory until the first time it is needed by the program; denote by P̂k the first k atoms generated by the program and say that P̂ is a lazy size-biased version of P if P̂k d = P̃k for all k ∈ N. Let X := (X1, X2, . . . ) be a sequence taking values inW , Xn = (X1, . . . , Xn) its size-n prefix, and Kn the number of unique values in Xn. P̃ is defined to be induced by X if P̃ is realized by labeling the atoms of P in the order in which they appear in the sample X1, X2, . . . ∼ P. The following examples illustrate the concept. Sampling the DP by recursive coin-flipping. The stick-breaking construction of the DP [34; 16] yields a simple way to generate atoms of P when P is drawn from a DP prior with concentration parameter θ > 0 and base measure H0 (assumed to be non-atomic). Xn can be sampled as follows: Algorithm 1 Recursive coin-flipping for sampling from the DP 1: M = 0 . For tracking the number of atoms initialized. 2: for i = 1: n do . Iterate over observations. 3: j = 0, coin = 0 4: while coin==0 do . Recursively (in j) flip Vj-coins until the first heads. 5: j = j + 1 6: if j > M then . Instantiate Vj and Ωj when necessary. 7: Vj ∼ Beta(1, θ), Ωj ∼ H0 8: M = M + 1 9: end if 10: coin ∼ Bernoulli(Vj) . Flip a Vj-coin. 11: end while 12: Xi = Ωj . Xi takes the value of the atom corresponding to the first heads. 13: end for 14: return Xn A random number Mn of atoms are generated as needed by the program (equal to M when Xn is returned). With positive probability Mn is larger than Kn, the number of unique values in Xn. Sampling the DP by induced size-biased representation. The stick-breaking construction of the DP is distributionally equivalent to the size-biased representation of the DP [26; 28]: P̃j d = Vj ∏j−1 i=1 (1− Vi) jointly for each j. Hence, the predictive distribution of Xn+1 given P̃Kn is P[Xn+1 ∈ • | P̃Kn ] = ∑Kn j=1 P̃jδΩ̃j ( • ) + ( 1− ∑Kn j=1 P̃j ) H0( • ) . (2) X can be sampled from P by using (2): If Xn+1 belongs to a new category (which happens with probability (1 − ∑Kn j=1 P̃j)), then Xn+1 = Ω̃Kn+1 ∼ H0, VKn+1 ∼ Beta(1, θ), and P̃Kn+1 = VKn+1 ∏Kn j=1(1− Vj). In this way, only the first Kn atoms of the size-biased representation are generated, corresponding to those chosen by the elements of Xn. Therefore, Kn ≤Mn with probability 1 and P̃Kn is induced by Xn. If the atom weights P̃Kn in (2) are marginalized with respect to the distribution of P̃Kn | Xn, another prediction rule, or urn scheme can be used to sample Xn; in the case of the DP, this gives rise to the Chinese Restaurant Process (CRP) [28], which can be used in the same way to sample X. 1.1 The laziest initialization We define a lazy initialization scheme P̂ for a discrete RPM P to be minimal with respect to X if, with probability 1 for each n ∈ N+, the number of initialized atoms is Kn and the mapping

منابع مشابه

Measure Transformer Semantics for Bayesian Machine Learning

The Bayesian approach to machine learning amounts to computing posterior distributions of random variables from a probabilistic model of how the variables are related (that is, a prior distribution) and a set of observations of variables. There is a trend in machine learning towards expressing Bayesian models as probabilistic programs. As a foundation for this kind of programming, we propose a ...

متن کامل

Learning Stochastic Inverses for Adaptive Inference in Probabilistic Programs

We describe an algorithm for adaptive inference in probabilistic programs. During sampling, the algorithm accumulates information about the local probability distributions that compose the program’s overall distribution. We use this information to construct targeted samples: given a value for an intermediate expression, we stochastically invert each of the steps giving rise to this value, sampl...

متن کامل

Hybrid Probabilistic Search Methods for Simulation Optimization

Discrete-event simulation based optimization is the process of finding the optimum design of a stochastic system when the performance measure(s) could only be estimated via simulation. Randomness in simulation outputs often challenges the correct selection of the optimum. We propose an algorithm that merges Ranking and Selection procedures with a large class of random search methods for continu...

متن کامل

Using probabilistic programs as proposals

Monte Carlo inference has asymptotic guarantees, but can be slow when using generic proposals. Handcrafted proposals that rely on user knowledge about the posterior distribution can be efficient, but are difficult to derive and implement. This paper proposes to let users express their posterior knowledge in the form of proposal programs, which are samplers written in probabilistic programming l...

متن کامل

Semantics Sensitive Sampling for Probabilistic Programs

We present a new semantics sensitive sampling algorithm for probabilistic programs, which are “usual” programs endowed with statements to sample from distributions, and condition executions based on observations. Since probabilistic programs are executable, sampling can be performed by repeatedly executing them. However, in the case of programs with a large number of random variables and observ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017